NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

On the Complexity and Typology of Inflectional Morphological Systems

https://doi.org/10.1162/tacl_a_00271

Cotterell, Ryan; Kirov, Christo; Hulden, Mans; Eisner, Jason (November 2019, Transactions of the Association for Computational Linguistics)

We quantify the linguistic complexity of different languages’ morphological systems. We verify that there is a statistically significant empirical trade-off between paradigm size and irregularity: A language’s inflectional paradigms may be either large in size or highly irregular, but never both. We define a new measure of paradigm irregularity based on the conditional entropy of the surface realization of a paradigm— how hard it is to jointly predict all the word forms in a paradigm from the lemma. We estimate irregularity by training a predictive model. Our measurements are taken on large morphological paradigms from 36 typologically diverse languages.
more » « less
Full Text Available
Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

Roark, Brian; Wolf-Sonkin, Lawrence; Kirov, Christo; Mielke, Sabrina J.; Johny, Cibu; Demirsahin, Isin; Hall, Keith (May 2020, Proceedings of the 12th Language Resources and Evaluation Conference)

Full Text Available
Unsupervised Disambiguation of Syncretism in Inflected Lexicons

https://doi.org/10.18653/v1/N18-2087

Cotterell, Ryan; Kirov, Christo; Mielke, Sebastian J.; Eisner, Jason (June 2018, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT))

Full Text Available
Neural Polysynthetic Language Modelling

Schwartz, Lane; Tyers, Francis; Levin, Lori; Kirov, Christo; Littell, Patrick; Lo, Chi-kiu; Prud'hommeaux, Emily; Park, Hyunji Hayley; Steimel, Kenneth; Knowles, Rebecca; et al (May 2020, ArXivorg)

Many techniques in modern computational linguistics and natural language processing (NLP) make the assumption that approaches that work well on English and other widely used European (and sometimes Asian) languages are “language agnostic” – that is that they will also work across the typologically diverse languages of the world. In high-resource languages, especially those that are analytic rather than synthetic, a common approach is to treat morphologically-distinct variants of a common root (such as dog and dogs) as completely independent word types. Doing so relies on two main assumptions: that there exist a limited number of morphological inflections for any given root, and that most or all of those variants will appear in a large enough corpus (conditioned on assumptions about domain, etc.) so that the model can adequately learn statistics about each variant. Approaches like stemming, lemmatization, morphological analysis, subword segmentation, or other normalization techniques are frequently used when either of those assumptions are likely to be violated, particularly in the case of synthetic languages like Czech and Russian that have more inflectional morphology than English. Within the NLP literature, agglutinative languages like Finnish and Turkish are commonly held up as extreme examples of morphological complexity that challenge common modelling assumptions. Yet, when considering all of the world’s languages, Finnish and Turkish are closer to the average case in terms of synthesis. When we consider polysynthetic languages (those at the extreme of morphological complexity), even approaches like stemming, lemmatization, or subword modelling may not suffice. These languages have very high numbers of hapax legomena (words appearing only once in a corpus), underscoring the need for appropriate morphological handling of words, without which there is no hope for a model to capture enough statistical information about those words. Moreover, many of these languages have only very small text corpora, substantially magnifying these challenges. To this end, we examine the current state-of-the-art in language modelling, machine translation, and predictive text completion in the context of four polysynthetic languages: Guaraní, St. Lawrence Island Yupik, Central Alaskan Yup’ik, and Inuktitut. We have a particular focus on Inuit-Yupik, a highly challenging family of endangered polysynthetic languages that ranges geographically from Greenland through northern Canada and Alaska to far eastern Russia. The languages in this family are extraordinarily challenging from a computational perspective, with pervasive use of derivational morphemes in addition to rich sets of inflectional suffixes and phonological challenges at morpheme boundaries. Finally, we propose a novel framework for language modelling that combines knowledge representations from finite-state morphological analyzers with Tensor Product Representations (Smolensky, 1990) in order to enable successful neural language models capable of handling the full linguistic variety of typologically variant languages.
more » « less
Full Text Available
Neural Polysynthetic Language Modelling

Schwartz, Lane; Tyers, Francis; Levin, Lori; Kirov, Christo; Littell, Patrick; Lo, Chi-kiu; Prud'hommeaux, Emily; Park, Hyunji Hayley; Steimel, Kenneth; Knowles, Rebecca; et al (May 2020, Final Report of the Frederick Jelinek Memorial Summer Workshop)
null (Ed.)
Many techniques in modern computational linguistics and natural language processing (NLP) make the assumption that approaches that work well on English and other widely used European (and sometimes Asian) languages are “language agnostic” – that is that they will also work across the typologically diverse languages of the world. In high-resource languages, especially those that are analytic rather than synthetic, a common approach is to treat morphologically-distinct variants of a common root (such as dog and dogs) as completely independent word types. Doing so relies on two main assumptions: that there exist a limited number of morphological inflections for any given root, and that most or all of those variants will appear in a large enough corpus (conditioned on assumptions about domain, etc.) so that the model can adequately learn statistics about each variant. Approaches like stemming, lemmatization, morphological analysis, subword segmentation, or other normalization techniques are frequently used when either of those assumptions are likely to be violated, particularly in the case of synthetic languages like Czech and Russian that have more inflectional morphology than English. Within the NLP literature, agglutinative languages like Finnish and Turkish are commonly held up as extreme examples of morphological complexity that challenge common modelling assumptions. Yet, when considering all of the world’s languages, Finnish and Turkish are closer to the average case in terms of synthesis. When we consider polysynthetic languages (those at the extreme of morphological complexity), even approaches like stemming, lemmatization, or subword modelling may not suffice. These languages have very high numbers of hapax legomena (words appearing only once in a corpus), underscoring the need for appropriate morphological handling of words, without which there is no hope for a model to capture enough statistical information about those words. Moreover, many of these languages have only very small text corpora, substantially magnifying these challenges. To this end, we examine the current state-of-the-art in language modelling, machine translation, and predictive text completion in the context of four polysynthetic languages: Guaraní, St. Lawrence Island Yupik, Central Alaskan Yup’ik, and Inuktitut. We have a particular focus on Inuit-Yupik, a highly challenging family of endangered polysynthetic languages that ranges geographically from Greenland through northern Canada and Alaska to far eastern Russia. The languages in this family are extraordinarily challenging from a computational perspective, with pervasive use of derivational morphemes in addition to rich sets of inflectional suffixes and phonological challenges at morpheme boundaries. Finally, we propose a novel framework for language modelling that combines knowledge representations from finite-state morphological analyzers with Tensor Product Representations (Smolensky, 1990) in order to enable successful neural language models capable of handling the full linguistic variety of typologically variant languages.
more » « less
Full Text Available
UniMorph 3.0: Universal Morphology

McCarthy, Arya D.; Kirov, Christo; Grella, Matteo; Nidhi, Amrit; Xia, Patrick; Gorman, Kyle; Vylomova, Ekaterina; Mielke, Sabrina J.; Nicolai, Garrett; Silfverberg, Miikka; et al (May 2020, Proceedings of the 12th Language Resources and Evaluation Conference)

Full Text Available
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection

https://doi.org/10.18653/v1/2020.sigmorphon-1.1

Vylomova, Ekaterina; White, Jennifer; Salesky, Elizabeth; Mielke, Sabrina J.; Wu, Shijie; Ponti, Edoardo Maria; Hall Maudslay, Rowan; Zmigrod, Ran; Valvoda, Josef; Toldova, Svetlana; et al (July 2020, Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology)

Full Text Available

Search for: All records